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ABSTRACT 

This paper describes two models of the cost of data movement 
in parallel numerical algorithms. One model is a generalization of 
an approach due to Hockney, and is suitable for shared memory 
multiprocessors where each processor has vector capabilities. The 
other model is applicable to highly parallel nonshared memory 
MIMD systems. In this second model, algorithm performance is 
characterized in terms of the communication network design. 
Techniques used in VLSI complexity theory are also brought in, and 
algorithm independent upper bounds on system performance are 
derived for several problems that are important to scientific com- 
putation. 


1. Introduction 

The traditional model of parallel algorithm analysis was motivated by a 
desire to explore the potential of parallelism. Thus the question was asked: 
given an unlimited number of processing elements and an infinite capacity to 
move and permute data, what is the fastest method to solve the problem under 
consideration? This has proven to be a fruitful area of research and much has 
been learned. However, with the appearance of the Illiac IV, the Cray 1 and the 
CDC Cyber 205, it was quickly realized that the design of data structures and the 
cost of processor to processor and processor to memory communication are 
critical ingredients in the design and analysis of practical algorithms. The goal 

of this paper is to highlight the role played by communication cost in the 
analysis of numerical algorithms. 

Because the spectrum of parallel architectures suitable for scientific com- 
putation is so broad, it is difficult to derive one analytical model of computation 

— 1 Research supported by the National Aeronautics and Space Administration under NASA 
Contract Nos. NASl-17070 and NAS 1-17130 while the authors were in residence at the Insti- 
Uite for Computer Applications in Science and Engineering, NASA Langley Research Center, 
Hampton, VA 23665. Primary support for the first author wa 3 provided by an IBM Faculty 
Development Grant. 
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that characterizes the performance of every machine. In this paper we restrict 
our attention to three families of machine architectures and describe an analyt- 
ical model of performance that is reasonably suited to each. In particular, our 
goal for each model is to characterize the effect of communication costs on sys- 
tem performance. In each model we give an estimate of effective efficiency or 
speed-up of a computation as a function the latency and bandwidth of the pro- 
cessor communication medium. 

rhe first model is that of a "medium scale" shared-memory multiprocessor, 
having perhaps 2 to 32 processors, with each processor capable of exploiting 
substantial local vector parallelism. Section 2 of this paper gives a formal 
description of the shared memory model and illustrates a method of analyzing 
algorithms for machines of this type. Several standard, but important, numeri- 
cal problems are studied and a number of alternate implementations are 
analyzed. In particular, it is shown that for machines which have two levels of 
parallelism the performance of algorithms depends strongly on the way in which 
the problem is partitioned to fit on the architecture. The performance of the 
algorithms is given as a function of global and local memory latencies, the speed 
of arithmetic operations, the number of processors, and the size of the problem. 

The second model is that of a highly parallel MIMD system where processors 
communicate through a large network and there is no shared memory. We 
assume here a number of processors ranging from perhaps 32 to a few thousand, 
but with processors of lesser power than in the shared memory model. Analysis 
and design of algorithms for such systems turns out to be significantly different 
than it is for the shared memory machines. In section 3 it is shown that the 
techniques used in VLSI complexity analysis can be used to derive reasonable 
upper bounds on speed-up and efficiency. The appropriate parameters for this 
analysis turn out to be the ratio of message transmission times to arithmetic 
speed, and the relation of the problem being solved to the topology of the com- 
munication network. By looking at specific algorithms it is shown that many of 
the derived upper bounds are exact. 

As a variant of this second architecture model, in section 4 we consider 
machines interconnected by packet switched communications networks. 
Analysis of algorithms for such machines is similar to analysis of algorithms for 
other non shared memory machines, except communication delays play a cen- 
tral role. The paper concludes with a discussion of the shortcomings of the 
approaches described here and suggests several directions where more work 
needs to be done. 

2. Shared Memory Machines 

One of the clearest trends in commercial systems is the trend toward mul- 
tiprocessor shared memory architectures (see Figure 2.1), where each proces- 
sor has either a pipelined multitasking or vector capability. This family of mul- 
tiprocessors includes the Cray X-MP, Cray-II, the HEP-I and HEP-// 1 and the ETA 
GF10. The proposed Cedar multiprocessor [GKLS83] may be viewed as a machine 
in this class, where each "processor" is in fact a cluster of smaller processors, 
in this section we consider the design and analysis of algorithms for such 
machines. We begin with a list of properties shared by many machines in this 
class. (Of course, no list will characterize every machine and the set of 
specification below should be considered only as an approximation to a family of 

— 1 The Hep [ and II machines belong in this class, though the model of analysis given below 
Coes not fully describe their performance. 




Figure 2.1. Shared Memory Multiprocessor 
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architectures.) 

1. There are p processors, with p roughly in the range 2Sp^32. 

2. All processors have equal access to shared memory and vectors may be of 
arbitrary length and stride z . 

3. Each processor also has a sizable local memory from which it can fetch vec- 
tors of arbitrary length and stride. 

4. Each processor can perform vector diadic operations (or vector triadic 
operations where one operand is a scalar) using operands in from either the 
local memory or global shared memory. The execution time for a vector 
operation of length n is 

r - 1 ( n + **#) 

if either operand is in global memory, while it is 

r~ l (n + n£) 

if both operands are in local memory. Here r ~ l is the asymptotic perfor- 
mance rate for one processor, and is the vector length required to 
achieve half the asymptotic performance rate, an idea due to Hockney and 
Jesshope [HoJe8l]. Here we assume local memory accesses have much less 
latency than global memory accesses, and thus 

n k « n $ 

The scale of this inequality depends on the machine. For a system that uses 
a network of log(p) stages to bring data from shared memory into local 
(cache) memory, one may have n% - nfa + C*log(p) for some constant C. 

In general, the ratio — v-might vary between 10 and 1000 . a 

n H 

5. The parallel execution ofp tasks onp processors is denoted by: 

pardo(i = l,p) 
task(i); 
endpar; 

Here task(i) is a procedure, block, or statement that is executed on the i-th 
processor. No assumptions will be made about the processor synchroniza- 
tion or task scheduling mechanisms other than that the execution order 
will be consistent with the serial data dependencies. 

Algorithm Design 

The most natural way to design algorithms for these systems is to employ 
what the mechanical and structural engineering community has called problem 
substructuring. In this approach, the problem is divided into a set of indepen- 
dent tasks, each operating on its own portion of the data structure. 

~T~r 

in soiue sy items performance may be severely degraded if two processors access the 
same vector or an access has non-unit stride. The algorithms that follow avoid multiple 
accesses, but to simplify the analysis, non-unit stride performance problems have been ig- 
nored. 

-• An alternate assumption would be that the latencies nfc and nS are equal but that the 
computation rates for operand from global and local memory differ, or that all vector opera- 
tions on global data must be "cached" to local memory before execution. An analytical 
mode, can be built from any such set of assumptions. 


To illustrate this idea we consider an example studied many times in the 
literature [HoJe8l],[Brka8l].[LaVo75],[Ston75],[HeU77]: that of solving T sys- 
tems of tridiagonal matrix equations each of size n. Our notation is as follows. 
Vectors and arrays will be denoted with capitol letters (X, Y) and problem 
instances will be denoted with a superscript. Scalars (and vector components) 
will be denoted by lower case letters (with subscripted positions). A range of 
superscripts or subscripts will be denoted by [i;j] where i is the first element 
and j is the last element. In general, we shall use superscripts to denote equa- 
tion numbers and subscripts to denote the row within the matrix of an equation. 
Let A be the tridiagonal matrix whose i th row has nonzero elements (6<, a it c t ). 
We seek the solutions of T tridiagonal systems. 

A*X* = Yl, j = l ,T 

We assume here, and through out this paper, that the matricies A are all sym- 
metric positive definite and can be factored without partied pivoting. 

The simplest algorithm is to divide the problem into p sets of problems 
each containing T/p subproblems. The standard vector algorithm is then run 
on each processor. 


simple( A, Y, X) 

Pardo(i = 1, p) 

begin (* on processor i do *) 
r = (i-l)*T/p+l; s = i*T/p; 


end; 

endpar; 


for j = 2 to n do begin 
mS- r * 1 = - 

a tii] = a ltd + m * c /r : ' 1 v 


v}Zi ] = vjl i* 1 + 
end; 

for j = n-1 downto 1 do 

x l r * ] = (y/ r:#1 - c/ r :*l*x£f])/a/ r:5] ; 


The indices r,s,j and vector m within each block are assumed to be local to the 
executing processor. If all data is stored and fetched from shared (global) 
memory (local memory used only for m) then the cost of this algorithm is 

T «mp.cu _ r -i( Qn _ 7 )(I_ + ng) 

On the other hand, if one first brings the matrix and data vectors down to the 
local memory and then solves the —problems there and copies the solution vec- 
tor back to global memory the cost is in terms of for the expression above, 
but it also includes the movement of 5 vectors (afcgj, 6 »J , efjgj, ^ 

, ^’"•’1) each of length — — to and from shared memory. 

T «mp.Ut = r - 1(8n -7)(£+ng)+5*(-^+ng) 

One useful method for comparing these two implementations of this algo- 
rithm is to compute the ’’effective efficiency” of each. Observe that the asymp- 
totic speed of our machine is r_p operations per second. The set of T tridiago- 
nals requires (8ti— 7)T operations. If the machine could be programmed to 
operate at 100% efficiency the execution time would be 
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fo pt — r -i (871—7)7’ 

P 

The effective efficiency of the simple algorithm operating from shared 
memory is defined by 


gsimp,GIi — T 0 ?* 


'primp, CM 


( 2 . 1 ) 


T 

For n »p, the algorithmic effective efficiency for the simple algorithm operat- 
ing from local memory is approximately 




8 / 13 


( 2 . 2 ) 


lt .H t SPnfi 

13 T 13 nT 

Thus, if T is large in relation to png. the latency cost for global memory access 
gets masked by the arithmetic, and the global memory scheme is superior. On 
the other hand if T is near ng, then the local memory method is superior. 

An alternative solution is to substructure the problem so that processor i 
eliminates variables (i— l)n/p + l through in/p—1 from each of the T systems 
The result is a set of T problems of size 2 p. (This is a variation on a parallel 
algorithm of Sameh and Kuck [Saku78].) 

Substructure^, Y.X); 

eliminate: pardo(i = 1, p) 

(* in processor i do *) 
for i = (i- 


j - (i’i)* n /P. + 2 to i*n/p - 1 do begin 

= aA:f I + 

WP =mW* b Luri 
yttp = y&i T] + 


V}1 r 

end; 
endpar; 
pado(i = 1, p) 

(* in processor i do *) 

fOF 1 ~ n n / p " 1 r d .^.Y ntc ? O’ 1 )* 11 /? + 2 do begin 



end; 
endpar; 

The resulting system is shown in Figure 2.2. The subsystem corresponding to 
rows (x-1 )*n/p and i*n/p - 1 for i = l,p can be solved by the "simple" 
method described above and the remaining variables earn be solved by the "back 
solve" process 

bk-solve: pardo(i = l,p) 

for j = f (i-l)*P/n+l to i*p/n-l do begin 

$3 if £>#3 ^*'“ + 

end; 

endpar; 
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for p=4 processors. 
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Figure 2.2 Substructured Elimin ation 
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As urning all vectors are fetched and stored in shared memory, the algorithm 
given above requires 

T? bat - G “(n,T) = 17(^--2)*rZ'(T + n? /z ) + (l6p-7)*r;'(j-+n? /2 ) + 4S 

time steps where S is the cost of processor synchronization. To do the same 
algorithm from local memory requires the downloading of 4 vectors of length 
nT/p for the original problem and 4 vectors of length T for the T subproblems 
of size p divided among p processors. The uploading of the subproblems 
requires one vector move of length T for each processor and uploading the final 
solution is a vector move of length nT/p. The penalty for doing the substruc- 
tured algorithm in local memories is then 

' 5*(— + T + 2 n£). 

P 

Define the multiprocessor speed-up for the simple algorithm from shared 
memory by the relation 

^<«.n = 

For the shared memory version, this value is approximated by 

P(T + ng,,) 

T + pnf/2 


For T - n 1/2 this gives a speed-up of less than 


_ 2 £. 


F or large n and ignoring 


(p + 1) 

S, the substructured algorithm that uses only shared memory has a speed-up of 
approximately 

8 

gwubstr _ 17 


I 


1+ 16 (T+pn c ) 

17n ( T+n° ) 

over .the single processor simple method. In the range T = nf /2 the substruc- 
! ,tured algorithm yields a speed-up of approximately p/ 2. In fact, the reader can 
’ verify that the point at which the simple algorithm is superior to the substruc- 
tured algorithm when both use only shared memory is 

4r„p 

« nTl • / 1TI H 
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T > £<8p - 17 )«£ - 


-S. 


9(n-2p) 

If we again consider effective efficiency, one finds that for n »p the approxi- 
mate performance is 

grubmt ,CU — 8/ 17 


2 _ G 


1 + 2LL + 16p 2 n 
T 17nT 


(2.3) 


jpsubst 


,Ui - 


8/ 22 


(2.4) 


1 + UJlL (I0p + 16p 8 )ng 
22 T BnT 

Figure 2.3 plots the effective efficiencies (equations 2. 1-2.4) for the four methods 
described above as a function of T for n = 84,000, p=32, n L = 10, n G = 1000. 
Observe that as T becomes large the simple shared memory method has the 
best asymptotic performance. On the other hand, for small T the local memory, 
substructured algorithm is clearly superior. Consequently, we find the choice of 
optimal algorithm depends heavily on the relation of problem parameters to 



Figure 2.3. Effective Efficiencies F^p-CU gompjji rmbat.cu FtU b*jji 
for n =64000. p = 32. »f = 10 . n g = looo ; pIotted •„*. func ^? 
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Hockney-Jessope parameters, TOj/g and the multiprocessor parameters p and S. 


ADI Iterations 


As an application of these results consider the solution of a system of finite 
difference equation arising from the solution of a partial differential equation on 
a t vo dimension square domain. The region is discretized as an to by to grid and 
the differential operator is approximated by a sparse matrix M of size to 2 by to 2 . 
To find an approximate solution to the partial differential equation requires that 
we solve the equation MX—Y where Y is a given to by to array of values. A com- 
mon technique used to solve for X array is to view M as approximately factored 
into a product A*B where the matrix A is a system of tridiagonal matrices link- 
ing the rows of the X array and B has the same structure but it links the 
columns of the X array. To find X = B~ l A~ l Y requires us to first solve to tridiag- 
*, onal systems 

A* Z* - Y* for li j «£ to 

where the superscript j refers to the jth column of the gird. Then the set of 
solution columns Z i is viewed as a set of vectors in the solution of another to tri- 
diagonal systems 


BjXj - Zj for 1 ss j ^ to 

This method is known as the Alternating Direction Implicit (ADI) method and is 
used in many applications [PeRa55]. We examine two solution schemes. 


Assume the components of the arrays are stored by rows. The most natural 
partitioning of the algorithm is to substructure the system ( A matrix , B matrix 

. and Y array) into blocks of size ^-by to. Let r t = ^-ands ( = ■ Iv *' 1 ? 71 . The 
block is the set of coefficients corresponding to ^-column equations and 

the block corresponds to components through s< of all to row equa- 

tions. By "downloading” the data 


Me S3 1 , els'. r[ ti{‘ ) 

into the local memory of processor i, one may use the "simple" algorithm to 
compute 


Z [r: * 1 = (A [r:a J)- 1 yt r: »i 

in each processor using only local memory. Then, without "uploading" the Z 
array back to shared memory, it is possible to use the substructured method to 
solve the row equations 


= (b m )-'z. 

The only use of the shared memory is to solve to reduced systems of size 2 p 
required to complete the substructured elimination. The total time to complete 
this is 


Step 1. Download data and solve column equations 

* r-'(8n-7)(2- + n6) 

Step 2. Do the substructured elimination and upload final results 
17 (~ 2 )rZ l (n +71%) + ( 16 p- 7 )( 2 -+ n£) + 8 (n + ng) + ( ~-+ ng ) 

The alternative is to use the simple algorithm for both column and row 
equations. Unfortunately, this requires that the partial solution vector Z be 
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moved back to shared memory and read back to local memory in transposed 
order. Based on our memory addressing assumptions this step requires a 
minimum of rZ x ^{n + n$) seconds. 

Step 2’. Transpose Z and use the simple method and upload final results. 
rj 1 (8n-7)(^+ n£) + 2rZ l (^- + n$) + rZ l ^{n + n$) 

To determine the effective efficiency observe that to compute B~ x A~ l Y 
requires at least 7 i( 16 ti— 14) operations. If the machine cam be programmed to 
run at 100% efficiency the execution time would be 

rJ*2L(l8n-14) 

P 

The asymptotic efficiency for the substructured algorithm is 50% (computed in 
the limit as n goes to infinity) and 62% for the transposed simple method.' In the 
case that n is small (near or below n$) the substructured algorithm is superior. 
T° illustrate this, consider the special case of p=:32 processors, njf = 1000 and 
n n = 10. Figure 2.4 depicts the efficiency as a function of n. The cross point at 
which both methods are equal is when n is approximately 12000. In general, one 
can show that the substructured algorithm is superior to the simple scheme 
when 

71 25 fail + fo- ^) n k) 

HI. Algorithm Analysis for Machines Based on Large Networks ; 

In this section we consider systems built from a large number of simple 
processors interconnected by a communication network. Each processor con- 
tains local memory, but there is no global shared memory. These architectures 
can be based on a variety of types of communications networks. These networks 
can be of fixed topology, such as a ring or mesh, or can be packet switched or 
circuit switched networks. Numerous examples exist. The Finite Element 
Machine [Jord78] is a mesh connected lattice of 36 processors. The Cal-Tech 
Cosmic Cube contains 64 processors connected as a binary 6-cube. The Non-Von 
(Columbia Univ.) is a tree of processors. The CHiP architecture is lattice of pro- 
cessors [Snyd82] interconnected via a circuit switching network which can be 
configured as any member of a large family of graphs. The Boolean Vector 
Machine (Duke University) is an implementation of the Cube Connected Cycle 
network [PrVi82], These machines are all non-shared memory MIMD architec- 
tures based on large communications networks. 

In all of the above machines, no shared memory is used, and all interpro- 
cessor communication takes place via explicit interprocessor communication 
steps. In other cases a processor connection network is used to emulate a 
crossbar switch. Examples in this category the CDC Cyber plus which is a net- 
work of four rings with 16 processors in each ring. Zmob (Univ. of Maryland) and 
Crystal (Univ of Wisconsin) are based on ring architectures. A host of other sys- 
tems share properties that are similar to the machines cited, but have a global 
address space and share data through a processor to memory permutation net- 
work. These systems include the TRAC system [LiTr77], PASM [Sieg79], 
Cedar [GKLS8 2], and the ULTRA Computer (GoSc82]. These systems are con- 
sidered in section 4. 

Our principal focus in this section is on nonshared memory machines where 
data movement is explicitly controlled by the processors and limited by the 
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Figure 2.4. Algorithmic Efficiency as a function of n for 
n% = 1000, nfc = 10 and p = 32. 
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topology of the. network. The primary result presented here is that techniques 
developed for VLSI complexity analysis can be used to derive algorithm indepen- 
dent upper bounds on speed-up and effective efficiency that depend only on the 
problem being solved and the network topology. The upper bounds are derived 
for three problems: Fourier Transforms, Tridiagonal systems of Equations, and 
two dimensional elliptic boundary value problems. To prove that many of the 
bounds are exact, we describe the optimal algorithms. The system hardware will 
be modeled by the following rules: 

1. The system is composed of p processors where p could be very large 

p & 32. ’ 

2. Each processor has a sizable local memory and there is no shared memory. 

3. Each arithmetic operation takes a seconds. Initiation or receipt of a data 
transmission requires 0 seconds per word of data, and receipt of a message 
can be done immediately after transmission, or at any time thereafter. 
Processors do not ’’overlap" communication with arithmetic. (This is the 
primary focus of section 4.) 

4. The processors communicate with each other along paths that correspond 
to edges of a fixed connection graph. If the graph is complete, a crossbar 
network is modeled. If the graph is not complete, communication between 
processors not connected by an edge must be broken down into a sequence 
of message transmissions between processors along a path connecting the 
origin and the destination processors. 

In the paragraphs that follow we examine the performance of algorithms 
that have been implemented on this class of architectures, paying particular 
attention to the structure and cost of communication. 

The Cost of Communication. 

Let A be a parallel algorithm and let M be a machine withp processors. We 
can describe the interconnection topology of the machine M as a graph G(M). 
Similarly, the data flow graph of the algorithm A can be defined as the graph 
G(A) which is the directed acyclic graph whose nodes represent the operations 
m A, and whose arcs represent operand data dependencies. By an implementa- 
tion of A on M we mean a mapping 

im:G(A) -* G{M) 

where the operations of G(A) are mapped to processors and the communication 
arcs map to G(M) in one of three possible ways. Let a and b be operators in 
G(A) that are connected by an arc c. Assume a and b are mapped to processors 
p a and p b respectively. The three possible mappings of the arc c are 

1. Processors p a and p b coincide. In this case arc c becomes a self-loop and 
no interprocessor communication is need to implement this arc. 

2. Processors p a and p b are connected by an arc in the graph G(M). In this 
casep a transmits a value top^ to implement arc c. 

3. Processors p a and p b are not connected in the graph G(M). In this case the 
arc c must be mapped into a path 

Pa ~ Pi. P2. ■ • ,Pk - Pb 

where p t is connected by an arc in G{M) top 1+1 . That is the communication 
^long arc c is mapped into a sequence of' interprocessor communications 
need to relay the message. 
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TTie time required to perform the communication represented by arc c will 
vary depending on which of these cases applies. In the first case, no communi- 
cation occurs. In the second case, the message must be sent and received 
requiring time 2fi. In the third case, where the message is relayed k times, the 
communication time becpmes 2fc/?. In this third case, performance of all pro- 
cessors along the path is degraded by the time spent forwarding messages. 

Let Tfi(a,p) be the time required to execute algorithm A on machine M. It 
is often the case that algorithms designed for this class of architecture take the 
form of a loop with two steps 

Repeat 

1. Permute the data via the Communication network. 

2. Execute a set of arithmetic functions in paralleL 
until done; 


(Though all parallel algorithms can be put in this form, many have optimal 
implementations that violate this structure. This case is briefly considered in 
the next section.) 7#(a,0) is the total amount of time required to execute step 2 
for all passes through the loop and 7*4(0./?) is the total amount of time required 
to complete the data routing. In this case we have 


T&M = Tj^a.O) + 7^(0./?) 

In the general case, I/O can be generated in one processor while another is 
engaged in arithmetic and messages are moving through the wires. In this case 
it is possible (but not trivial!) to show that the equality above becomes a ^ 
Hence, given a parallel program it is possible for us to put an upper-bound on 
execution time. For a given problem, we can ask what is the lower bound on the 
execution time for any algorithm running on machine M . Clearly, if we know the 
optimal serial algorithm SER, we have the bound 


l-rpSER ( a ,o) =5 7j|}(a,0). 

This bound is tight only if there is enough parallelism to exploit p processors. A 
technique devised by Thompson [ThomBO] (and extended by many others 
[Agga03], [BrGo82], [CaMoBl], [LeigBlj, [SavaBl], [ViulBO]) to study the area- 
time trade-on in VLSI design can be used to find lower bounds on T&(0 t (3). Define 
a bisector B of a graph G of n nodes to be a set of arcs of G whose removal 
separates G into two graphs of equal size ( i.e. if n is odd. one subgraph is size 

2 ~ + 2~ ant * °^ er * s °* s * ze 2 ^* Let &c be the number of arcs in the 
minimal bisector. The bisection bandwidth, 

b, 




/? - 

is the number of words per second that can cross any minimal bisector of G(M ). 
Let 5 be a minimal bisector of G(M) and im:G(A)->G(M) be an implementation 
of A on M with the following property 


If A has n inputs and n outputs then im maps — inputs and —outputs to 
each of the p processors. P P 

In other words, we assume the implementation "uniformly distributes" the 
problem. This condition implies the set Im _1 (5) is a bisector of the input and 
output nodes of G(A). Let be the size of the minimal bisector of G(A) for all 
algorithms that solve the given problem Pr. We then have 
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P 


r^-i 


>C{U) 


* TA( o,0). 


This inequality characterizes the communication "bottleneck" imposed by the 
network topology induced bandwidth constraints. Network delay can also play 
an important role. A function with n inputs and n outputs is said to be transi- 
tive if every output is a nontrivial function of every input. For any transitive 
function that has been uniformly distributed over p processors, the minimal 
time that information about each input can be propagated to each output is 
2 (3log(p). In this case we have 




+ /S max 


I °C( H) 


2 log(p) 




Values of bjy. have been derived for a wide variety of problems. We consider 
three problems important for scientific computation. 

1. FFT n : an FFT on n complex numbers X[0-,n — l]. 

2. TRiyi : The direct solution of a tridiagonal matrix of size n. 

The direct solution of the n linear equations obtained by a simple 
approximation (5 or 9 point stau*) of a second order elliptic boundairy value 
problem on a unit square discretized as a Vn by Vn grid. 

The first problem here is well known ([ThomBO]), but the second two have 
not been studied in this context and it is interesting to note that the same proof 
applies to ail three problems. 


Lemma 3. 1. 

Assume that n is even and each of the problems three FFT n . TRin . and 
Elln is solved by algorithms where the inputs X[0\m] and outputs y[0;m.] are 
equally distributed over all processors. Then all three problems are transitive 
functions of their inputs and 

b PFT n = n; 673^ = 2; = 2 Vn; 

Proof: 

Each of the problems above can be viewed as the problem of solving a 
matrix equation of the form 

Ax - y 

for a given n by n invertible matrix A. The basic idea is as follows: any bisector 
will divide the flow graph of an algorithm into two "machines" where one 
machine has one half the input vector, call it y u and the other machine has the 
other half, y z . The bisection width of the problem is defined to be the minimal 
amount of "information" about 1/1 that machine 1 must send to machine 2 plus 
the "information" about y z that machine 2 must send to machine 1. The theorem 
states that, for example, no algorithm exists for directly solving elliptic boun- 
dary value problems for which the the information flow between the two halves 
of tne system falls below 2 y/n words for all input vectors y. The formal proof 
requires a formal definition of information and algorithm. Assume, without 
much loss of generality, that A and y have values that are rational numbers. 
Define an algorithm to be any finite sequence of tests and branches and rational 
arithemetic operations (+,-,*,/). In other words, an algorithm is assumed to be 
a piecewise rational function of the inputs. The basic unit of information will be 
a rational number. 
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Let A denote the matrix obtained from A by permuting the rows and 
columns so that the components of x and y corresponding to machine 1 are the 

first —rows and the components of x and y corresonding to machine 2 form the 


bottom half of the system. The linear equation can now be written in the form 


A Xx - 
*2 


1 LI 


The inverse of A exists and, can be decomposed into blocks of size 
follows: 



n . n 
T by j-as 


or 


Xi = C l y 1 + Diy z i = 1,2. 

We pose the question: How much information about y z must be known to com- 
pute Xj? Let G(y z ) be the part of the algorithmin in machine 2 that encodes the 
minimal amount of information about the vector y z needed to compute x x given 
V i an d let F be algorithm in machine 1 used to compute x x given G(y z ). 

x \ = F(y x , G(y z )) 

Setting y x - 0, we have a new function F defined by 

= r(G(y z )) = F( 0. G(y z )). 

Bu then 


DiVz = F(G(y z )). 

If G returns k values it follows that 

k i rtmk{D x ) 

The same argument shows that the minimum amount of information about y x 
that one needs to compute x z is rank(D z ). The minimal bisection width b P is 
then given by 

min(rank(D 1 ) + rank (D z )) 

over all row and column permutations of the matrix A. Because the inverse of A 
exists and the blocks are of equal size, it can be shown that rank(Di) = rank{B i ) 
for i = 1,2. To complete the proof we make the following observations 

1 For a tridiagonal system Bi contains only one element and, hence has rank 


>’ 2 

I ■ 

■ ' 

3 


The block tridiagonal system obtained by the natural order of a finite 
difference operator has rank (Bi ) = n 54 which, the reader can verify, can not 
be reduced by any row or column permutations. 

The FFT matrix is composed of n column vectors that are of the form 

A * = [l.tf*. tf 2 *, • • • ^(n-Dtjr 


where is a primative root of unity- Selecting any — rows from any 
columns is easily shown to have rank QED 


71 
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Upper Bounds on Efficiency. 

Asymptotic lower bounds on the operation counts are well known for each of 
these problems. 

T’i 77 ’ = 2 n log(n), T™ = 8n-7. Tf 1 = Cn log(n) 

The constant 2 in the FFT algorithm assumes a complex operation requires only 
1 time unit. In terms of operations on real numbers only, the formula is 
3n log (n). The constant C depends on the special properties of the equation. In 
the case EU n we restrict our attention to the family of implementations of Fast 
Poisson Solvers [BuGo70, SCKu76]. In this case the lower bound 6 /P5n = b^ and 
the arithmetic serial complexity is 

Ti PS - 2n(log (n)+4) — 7 

The resulting lower bounds on parallel complexity can be expressed as upper 
bounds on effective efficiency. Letting r = f-the bound are 

ESP =£ 




EfF s s; 


To apply these bound to specific processor connections, we need only specify 
G(M) and determine bc{uy Figure 3.1 illustrates 5 graphs: 

Cam, the complete graph onp processors; &c(Q>w») = p. 

Shuf , a reduction of the shuffle-exchange graph on 2p processors obtained 
by identifying pairs of adjacent processors connected by exchange edges; 
b G{Shuf) - P- 

Ring, a ring ofp processors; b^j^) - 2. 

Tree, a tree ofp— 1 processors; 6<;(7V8,) = 1» 

Mesh, a by p^ octagonally connected grid of processors; bc(£a«/i) = p 

Given the Efficiencies, an upper bound on the speed-up Sfa can be derived 
using the relation S& = Ejfc. Using the relation SP& = pEft, one can derive an 
upper bound on speed-up performance for each of these network topologies. 
This information is given in Table 3.1. These bounds are exact, up to constant 
terms, as functions of n. Their primary short comings are that they often 
underestimate communication costs by a factor of log(p) for networks of large 
bandwidth. A better model of the delay seems to be needed. In the case of the 

FFT, the communication delay term is, in fact, --log(p) not just log(p) as we 

have indicated.) To show that some of these upper bounds are tight, we now con- 
struct algorithms and consider each case in turn. 





Shuf 
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Figure 3.1 Network Connection Graphs. 
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Table 3.1 Bisection Width Speed-up Upper Bounds. 

The value r = and the constants c and c are both less than 1. Terms 

preceded by < are exact if r is replaced by r log(p). The prefix « indi- 
cates the best known methods are substantially slower and = means that 
the bound is exact for the appropriate choice of the constants. 


The Fast Fourier Transform. 

The FFT algorithm on a problem of size n can be defined in many ways. 
Here we follow [AHU174]. The algorithm is expressed in terms of a sequence of 
log (ri) permutations. The k 01 term in this sequence is defined on a vector of 
length n, ^f[0;n — l] by the expression. 

&ftyk(X) “ ^*(0). i -^3> fc (n- 1) 

where 

PkU ) = 2* + O' + 2*" 1 mod 2*). 

These "butterfly" permutations are concatenated to form the flow graph of the 
FFT algorithm as shown in Figure 3.2. This butterfly graph has an interesting 

property. If p divides n we can group the columns into p blocks of —adjacent 

columns each. Then the resulting "quotient graph" of the butterfly graph Bfly 
is the butterfly graph Bfly k ^ g{p) . (See [FiFi82], [Degr83] for many other useful 
quotient graph relations). Assuming that inputs X(j-i) r to X }r -\ all reside on pro- 
cessor j for we can write the FFT algorithm as 

FFT( X[0;n-1] ); 

for i = log(n) downto 1 do begin 












Figure 3.2 Butterfly based FFT flow graph 
and equivilent Shuffle implementation. 
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Y = Bflydxy. 

pardo(j = l.p ) 

for k = (j-l)n/p to jn/p - 1 do 

** = + *i.k*Y k -, 

endpar; 

end; 
end ;* 


The values are powers of primative roots of unity that are given in [AHU174] 
and are not of concern here. As index variable i runs from log(n) down to 1, the 
permutation Bfly t can be accomplished using the interprocessor connection 
Blfyi-iogfa/p). For values of i log(n/p) no interprocessor communication is 

Tt 

involved. For the other values of i, there are — data items to be sent and 

n P 

received by each processor, so — communication steps are required. This 

2 n P 

requires time 8 on the Com network. 

P 

It can be shown that the logip) stage butterfly network is topologically 
equivalent to the logip) passes of the Shuf permutation. (The rearrangement is 
shown in Figure 3.2 and a formal proof is given in [ParkBO]). Consequently, on 
machines with Com and Shuf interconnection topologies, the execution time is 
approximately, 

= a^-iog(n) + ft St )log(p). 

P P 


with speed-up 


spm = 


i +r i22l£l 

login) 


To emulate the permutation Bflyie on a ring network requires uniform shifts of 
distance ±2 *' 1 . With each such permutation requiring 2/?2* seconds the speed- 
up is found to be 


o pFFT — 


1 + T- 


log(n) 

On a mesh connected computer the uniform shifts of distance 2* _1 can be done 
in time /32 fc- ^ 7 ^ (see [TbKu77]) and the speed-up is 


spR = — 

1 + T login) 

In the case of the Bing and the Mesh, the speed-up agrees with the theoretical 
upper bound and must be optimal. In the other cases, we feel the algorithm is 
optimal and the speed-up bound is too generous. 

Figure 3.3 depicts the relative efficiencies of the FFT algorithm on the three 
networks described above in the case thatp = 512 and r = 1.0. 


— 1 We have omitted the usual terminating "bit reversal" permutation to simplify the discus- 
:don. Inclusion of this permutation would have only a minor effect on the results here. 
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Tri diagonal Systems 

To solve the tridiagonal system AX - Y we assume that the inputs 
1] are located in processor i for ie[l..p]. We employ the same 
solution strategy as used in section 2. By substructured elimination we reduce 

7t 

the —equations in each processor to 2. This takes time 

12a(“ — 1) 

P 

and involves no interprocessor communication. To solve the reduced system of 
size 2 p let M be a tree of p — 1 processors with the root processor numbered 1 
and the children of the processor numbered 2 i and 2i + l. Assume the p 
pairs of equations are represented asp records e[0;p— 1] of the form 

eqn = record 

a,b,c,y: array[0,l] of real; 

end; 

After an operation of log(p) communication steps we may assume the elements 
of the array e have been stored in the leaves of M with equation-pairs s[2i] and 

e[2i + l] ie[0, ?“~l] stored in processor £-+ i. To solve the system of tridiago- 

nal equation on a tree of processors we use the following basic idea. Each inter- 
nal node of the tree receives a pair of equations from its two decendants giving 
it 4 equations 

b i x U - 1 + “t x h + c i x / 1+l = Vi * = 1.4 

in 8 variables 

*/„■ x h' ' ' ' • x h 

Let elim() be a function called by the tree node that applies the substructured 
elimination computation to a set of four such equations and sends first and last 
of the new set 

b i x io + °i x /i + °i x h = Vi 
b 4 x j 0 + a i x i 4 + c * x j s = y 4 

to its parent node and leaves the other 2 equations stored in the executing pro- 
cessor. To recursively reduce the 2p equations to 2 on the Tree connection we 
apply the function 


function reduce(i: integer): eqn; 
begin 

on processor(i) do 

reduce := if (i >= p/2 ) 

elim(e[2i-p], e[2i-p+l]) 

else 


end; 


elim(reduce(2i), reduce(2i+l)) 


Figure 3.4 illustrates the movement of the variables through the tree. Once the 
2 by 2 system has been solved a "backsolve" procedure can be used to move the 
solutions back to the leaf processors. The backsolve process requires 5 arith- 
metic steps per equation, and somewhat less communication than the forward 
elimination. Observe that this algorithm works with 4 equations per processor 


T 
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while other parallel tridiagonal system solvers need only three equations in each 
process. The advantage of this algorithm is that we can do the communication 
on the Tree while the standard methods require a more elaborate connection 
network. 

Taking both the elimination and the backsolve into account, the execution 
time is found to be: 

TfcZ n (a.,p) = a(17— + 34 log ip) + 9) + 17 0 log(p) 

P 

and the speed-up is approximately 

8 

c_7H_n — 1Z 

* " 1 + (2 + r) S!SRSeL 

71 

8 

This differs from the theoretical upper bound for this problem by the factor 

in the numerator, and by the factor 2 in the communication term. The later is 
due to extra arithmetic caused by "matrix fill" associated with the the sub- 
structed elimination and the constant 2 comes from redundant arithmetic used 
to solve the reduced system. 

To execute this algorithm on the Shuf network we need only embed the 
tree structure in the target graph (as shown in Figure 3.4) and the execution 
time and speed-up will be the same as on the tree. The tree has the property 
that any node can be reached in at most 2 log(p) steps from any other node, 
consequently no direct embedding of the tree is possible in the ring or mesh. 
We do not know of an algorithm for these networks that is within a factor of 
log(p) of the correct communication complexity. 

The tree computation above proceeds as a wave from the leaf nodes to the 
root. If we consider the problem of solving m tridiagonal systems of size n 
(m._jTi_ri) , we can pipeline the method above. The execution time is 

n^(a.« = 17a(^+2(n -»-!)♦ 3l*<p» ♦ 9a ♦ 30*n ♦«*<*» 

This result will be used below. 

Fast Poisson Solvers 

There are a wide variety of interesting direct solvers for the poisson equa- 
tion 

V 2 * = y 

(see [HoJeBl], [Gros79], [SCKu76]). The method here is the easiest to explain. 
Consider a grid of n# by points. The basic fast poisson solver operates in 
three steps on an array of values y. 

1. Apply an FFT to each row of the grid of y values, (y = Eow^FTs (y)). 

2. Solve n systems of tridiagonal equations of size n using the columns of the 
grid of y values as the right hand sides. 

3. Apply an inverse FFT to each row of the solution of the previous step. 

With Dirichlet boundary conditions, the FFTs used here are actually real sin 
transforms which can be computed with the complex FFT algorithm (see 
[HoJe82] for details). Assume we have p processors to execute this algorithm. 
Let A: be a divisor of p and partition the problem so that the columns are divided 

into k groups and the rows are divided into groups. We can then assign each 
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processor to a block with -^—columns and n% — rows. In steps 1 and 3 each row 

u k ^ 

of k processors must compute n“ — FFTs of size n\ Using the algorithms above, 
the minimal execution time for both steps is 

T + iptkoglk). 

P P 

In step 2, each column of processors must solve ^L_ tridiagonal systems of 
size n% As we have seen, this step takes 

t7H ~ 17a (~ + 2lo 9(^) - 25a + 30/9(^-+ log (£-) -l) 

To choose the optimal partitioning of the problem we minimize T 177 * + T 7 * as a 
function of k e[-*7r, m.in(p .n**)]. The function takes the form 

71 '* 

R + 30^ +(4/3^ — 34a — 30/3) log (k ) (3.2) 

where R is independent of*. Minimizing 3.2 as a function of k is, in general, not 
easy. There are two interesting cases to consider. 


Case 1. — — + — . 

P Yi P 2 

In this case the last term in 3.2 is negative, thus we pick * as large as possi- 
ble. If p < then we set * = p which implies that it is best to distribute 
the FFTs across all p processors and to solve the columns of tridiagonal sys- 
tems without communication (^—columns per processor.) If n** or p is 

greater than ~+ y-we And n& <p. Setting k = n% the FFTs are again 
distributed and the execution time is 

T rrs „ + J&. + 3 4«*(^ + + 3 0ioff( ^, 


Case 2. — > 


17 a 


P 

15 

2 


P 2/3 
The optimal value for * is found to be 

2 n% 


k = 7nax( 1, (■ 


-n“H(l + — 5L)W ) 
30 p" J 


15 P 

In the case that ^ P we have * = 1 and the execution time is 

^' FPS = a (~0°i7(p) + -^y-) + 34lop(p)) + 30/3(n^ + log(p )) 

Setting * — 1 implies that each row of the grid is stored completely in one 
processor and the FFTs involve no communication. Consequently, only the solu- 
tion of the tridiagonal systems involve communication and the time bound above 
is valid for a tree network. The resulting speed-up is of optimal complexity. On 
the other hand, the^theoretical lower bound for communication cost for an Ellip- 
tic problem is 0(7^7- for a mesh network and BC^—ior the Shut network. The 
p n P 

time estimate above suggests that this form of the Fast Poisson Solver is subop- 
timal for these networks. 

Based on the upper bounds for speed-up in table 3.1, Figure 3.5 illustrates 
the relative performance of optimal Fast Poisson Solvers as a function of 
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problem size (n) when p = 512 and r = l for the three network topologies Rine 
Mesh and Shuf. 6 ‘ 

Multi-Grid iterative methods (Jran81, GaVrB2] are structured so that each 
stage of the iteration reduces a n» by n# problem to a problem of smaller size 
which is solved by a direct method. Let the initial problem be distributed such 

that square subgrid of size (^-)^ by (jj-)# are mapped to each processor in a 

mesh. The communication cost to reduce the original problem to one of size p* 

by p * is for some constant C x . Using the FPS method (case 1) to solve 

the reduced problem on the mesh requires an additional communication 

steps. Applying K such iterations will reduce the error to a fixed level and cost 

mc x ^- + c&K) 

which is of the optimal complexity. Other methods, such as the preconditioned 
conjugate gradient can also be shown to have this communication bound. On 
the other hand, the authors know of no method that achieves the lower bound of 

^ or ^be networks with high bandwidth. 

4. Multistage Packet Switched Networks 

The analysis of non-shared memory multiprocessors to this point assumed a 
machine M with a fixed or switchable connection topology G(M). The only com- 
munication cost in this model was the tune /? taken by the processors to per- 
form sends and receives. For most non-shared memory multiprocessors, this 
model is probably quite reasonable. 

However, this model corresponds poorly to machines with multistage 
packet switched networks. An example of such a network is the Omega network 
of Lawrie [Lawr75]. With this type of network, messages can be broken into 
packets of uniform size where each packet consisits of a destination tag and a 
data field. The switches in the network read the destination tag and forward the 
packet along its route. An Omega network interconnection of p processors con- 
tains log(p) stages, each having switches. In the best case, a packet can be 

routed through the network in log(p) steps. However, when the network has 
heavy traffic, contention occurs and contending packets must be queued at the 
switches. Simulation results suggest packets are delayed by an average amount 

Colog (p) 

and that the number of packets transmitted per clock cycle is about 

Cip 

for any number of processors p. Thus the Omega network behaves much like a 
crossbar switch, except message propagation delays are relatively long 
[KrSn82], [GoSc82]. Performance of other log(p) stage packet networks, such as 
the Banyan network, baseline network, and so on, is comparable to that of the 
Omega network. 

In principal, packet switched networks can be accommodated in our mul- 
tiprocessor model by treating switches as specialized processors. Though feasi- 
ble, this approach is difficult, since the communications patterns in packet net- 
works are complex. A more illuminating approach is to view the packet switched 
network as a close emulation of a crossbar switch, and modify our multiproces- 
sor model accordingly. The model given at the beginning of this section needs to 
be modified only in the two provisions governing communication: 
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3 Each arithmetic operation takes a seconds. Transmission or receipt of a 
word of data takes 0 seconds. Receipt of a message can be done y seconds 
after its transmission or any time thereafter. 

4’ The connection topology is a complete graph. 

The delay parameter y here is designed to model the time taken to route 
and forward messages in the network, and the time packets spend queued at 
switches when there is contention. With this model, sending a one word message 
between two processors takes total time 2/3 + y. In sending a k word message, k 
sends and receives are required. But the propagation delays can be overlapped, 
so after receiving the first word, a new word can be received every /? seconds, 
giving a total message delay of: 

(* + l)/3 + y. 

With this model of a multiprocessor interconnected by a packet switching 
network, it is possible to look at any of the algorithms already considered. We 
look here at fast Fourier transforms, since there are interesting aspects of this 
algorithm not yet treated. The communications required in an FFT can all be 
viewed as permuting data between processors. Suppose we have n words of 

data, with —words per processor. Then we can ask, how much time is required 

to simultaneously move the data on every processor to some other processor. 
Let the time taken to perform this operation be denoted by Xp(n). 

To compute the value of Xp(n), note that to move —data words from one 

P 

processor to another should take time 

■(jr+iw+7- 

as discussed above. But this will be so only if the target processor is ready to 
receive the data as soon as it arrives there. In permuting n words of data, each 

processor must send —and receive —words, so the execution time for this per- 
mutation cannot be less than: 



In fact, the time perform this permutation is 

Xpin) = max[(^)J, (^-+ 1)0 + y] ( 4 . 1 ) 

as one can easily verify. 

Now consider the problem of performing an FFT on a vector of length n, with 
this multiprocessor model. The execution time will be: 

T Fn * = a(^log(n) + X p (n)log(p) 

~ a (~ 1 l0 9 (n) + max[(~)/S. (j- + l)/3 + ?]log(p) 

This is exactly the same as the execution time derived previously, except for the 
new term involving y. Notice here that the vector length n does not multiply y, 
so the impact of a large y, caused perhaps by packet contention, is not as severe 
as one might expect. 

Next consider the problem of performing multiple fast Fourier transforms. 
In treating fast Poisson solvers, it was implicitly assumed that to take fast 
Fourier transforms of m vectors, one would just repeat the parallel algorithm 
for a single FFT m times. This is not necessarily the best approach, especially 
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on packet switched networks where one needs to contend with propagation 
delays. At least four reasonable approaches to performing m fast Fourier 
transforms can be found: 

1 Repeat the parallel algorithm FFT_p, for a single data vector, m times. 

2 Combine m invocations of FFT_p to overlap communication. Each step of 
the FFT would be performed on all m data vectors at once before proceed- 
ing to the next step. 

3 If m £ p and m \ p, the data can be permuted so data vector i resides on 

processors (i-1) 1 through Then the algorithm FFTji for a single 

data vector can be performed on each block of - 2 - processors. Finally the 

771 J 

results must be permuted back to their correct locations. 

4 If p and p \ m, the data can be permuted so each data vector resides 
on only one processor. Each processor then performs sequential FFTs on 

tbe - data vectors it has, and finally the results are permuted back to 
their correct locations. 

Now looking in detail at each of these, for the first approach, the execution 
time is just m times the execution time of FFT_p. That is: 

T m~FFT.n - a (^2HL )i og ( n ) + m X p (n)lag (p) 

With the second approach, one will have only log(p) communication steps 
rather than mlog(p) as in the first approach. However, each step is now a per- 
mutation on mn words of data rather than on n words as in the first approach. 
The execution time is thus: 

= „( 22 ^ 1 0S ( n ) + ^( mn)Ios(p) 

At first sight there appears to be little difference between the two approaches. 
However, the function Xp(n) satisfies the inequality 

Xp(mn ) ^ mXp(n) 

for all m, so the second approach is always at least as good as the first 
approach. 

The third approach here is somewhat more complex. Two operations are 
involved, permuting the data, so each of the m vectors is distributed over a 

block of ^-processors, and then performing FFTs on these processor blocks. 
The FFT algorithm needed here is just the FFT for a single data vector, FFT_p, 
already studied, except only -^-processors are used now. The execution time to 

771 

perform these FFTs on processor blocks is: 

T?* = a(jj^)log(n) + X p (nm)log(2^ 

where we have used the identity X p/m (n) = X p (mn) which is easily derived 
from equation 4.1. 

The other operation needed is permuting the m data vectors. Each data 
vector is originally distributed evenly over the p processors, and must be moved 

so it is distributed over a block of -2-p rocessorSt The cos £ j s; 
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Print n IlldX 


277m / p — 1 , 


- mn , v -1 


+ 1 )/? + 7 


The factors 2— arise since a fraction of each vector is already in the proper 
processor memory and does not need to be transmitted. Assuming p is large, 
p — is close to unity and we cam set: 

Tdata ~ X p (mn) 

The data needs to be permuted before and after performing the FFTs, so the 
total execution time becomes 

<pm^FFLn - a( ™ n )log(n) + X p (mn)(log(p) - log(m) + 2) 

Analysis of the fourth algorithm is similar. No communication is involved in 
the FFTs in this case, but data permutations are required before and after the 
FFTs. The execution time is thus: 

IT-*™* = a(22^log(n) + zx p (mn) 

One way to compare these four algorithms for computing m FFTs is to com- 
pute their speed-ups. The results are: 





S* = 


Comparing these equations it is clear that the first algorithm is never 
better than the second, as already mentioned, since the impact of y is smaller in 
l he second. Note that this conclusion applies only for the packet switched net- 
works under consideration. For networks with fixed or circuit switched topology 
these two algorithms perform identically. 

Figure 4.1 illustrates the efficiency in the case that a = /? = 1.0 and y - 50.0, 
P ~ 512 and 7i = 1024. In this case we have plotted the preformance as a func- 
tion of the number of equations, m. Observe that the third algorithm becomes 
better than the second when 4. Had we included the cost of the "bit rever- 
sal permutation in all algorithms, the third algorithm would have become 
better even earlier. Between the third and fourth algorithms there is nothing to 
decide, since the third applies only to the case m £ p and the fourth to the case 
m s* p. Figure 4.1 depicts these two methods as one with a transition at m = p. 




Figure 4.1 Efficiency as a function fix for 4 methods to implement 
algorithms for doing m FFTs of size n. p = 512, n = 1024, a = £ = 1.0 and 7 = 50.0. 
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is fh P ; gh f i° r optimal al g° r ithms is interesting, the real issue here 

is the impact of the delay y caused by the use of a packet switching network. 

n h p , ffect ot y depends on the ratio of problem size to the number of proces- 

,. h _ ’ fl ° n pr °£ len *s a 8 reat deal of computation, y is well masked. In fact in 
the three FFr algorithms for multiple data vectors found to be best, y always 
enters the execution time in the ratio: 


Zmn a 

WaS performed only for FFT algorithms, experience sug- 

norfpni h 1 ^ f elays caused by packet switched networks are relatively unim- 
portant on most compute bound problems. 

5. Conclusion 

, J^ is paper has considered three basic families of multiprocessors and the 
analysis of communication complexity in the algorithms for these architecture 
® “ S r e . S . 1116 principal goal here was to look at communication and its impact on 
algorithm performance. For large shared memory multiprocessors analyzing 
communication turns out to be relatively straight forward. The main issues are 

I S Us efl'e y ct and fi " dinf! ’ rayS l ° ° rganiZe ° r SUb3trUcture P roblema to 

Studying algorithms on non-shared memory machines is more difficult 
since the topology of the communication network is a central issue. Our 
analysis of non-shared memory network based machines was divided into two 
parts, the first covering machines with a fixed or circuit switched topology, the 
second covering machines based on packet switched networks. 

On circuit switched machines, techniques borrowed from VLSI complexity 
theory provide a nice tool for obtaining lower bounds on algorithm complexity 
Given an interconnection topology, one can with relative ease compute upper 
bounds on efficiency of the problem solution. An important point here is that 
these are upper bounds on the problem, (e.g. FFT. Fast Poisson Solve, direct 
solution of tridiagonal systems) not on any particular implementation of an algo- 
lthm for solving the problem. In the cases studied, these upper bounds are 

apparantl f q . ulta tl gbt; in two of the three cases studied these upper bounds are 
actually attained. 

By contrast, analysis of algorithms on machines interconnected by a large 
packet switching network is far easier, given our simple model of the behavior of 
packet switching networks. Here the propagation delay parameter y, modeling 
he impact of packet contention, is quite important. But on most large prob- 
lems it seems to be possible to substructure the problem so that the effect of y 
is minor. With our model of a packet switched network, in which such a network 
is bleated as a crossbar switch with delay, analysis of algorithms is no more 
difficult than for shared memory multiprocessors. (In fact, the delay y is closely 
related to the value n$.) At the moment this model rests only on heuristic con- 
siderations and simulation results, so it would be valuable to establish the pre- 
cise circumstances under which it holds. 

Many important problems remain to be solved. In particular, improved 
techniques are needed for lower bounds on communication in multiprocessors, 
n the case of specific algorithms, we do not know of better lower bounds (or 

better algorithms) in the case of elliptic PDFs on high bandwidth networks like 
tne onuf connection. 
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Of particular importance are issues that were not considered at all in this 
paper. This includes a systematic approach to algorithms with dynamic data 
structures, such as adaptive grid algorithms for PDEs. Do these problems have a 
reasonably nice solution on nonshared memory systems? If so, what is the 
structure of the communication? A closely related problem is the analysis of 
complexity of communication in Data Flow machines. How does it differ from 
the models we have surveyed in this paper? 
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